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Abstract 

We consider the problem of boosting the accuracy of weak learning algorithms in the agnostic learning 
framework of Haussler (1992) and Kearns et al. (1992). Known algorithms for this problem (Ben-David 
et al., 2001; Gavinsky, 2002; Kalai et al. , 2008) follow the same strategy as boosting algorithms in the 
PAC model: the weak learner is executed on the same target function but over different distributions on 
the domain. Application of such boosting algorithms usually requires a distribution-independent weak 
agnostic learners. Here we demonstrate boosting algorithms for the agnostic learning framework that 
only modify the distribution on the labels of the points (or, equivalently, modify the target function). 
This allows boosting a distribution-specific weak agnostic learner to a strong agnostic learner with respect 
to the same distribution. Our algorithm achieves the same guarantees on the final error as the boosting 
algorithms of Kalai et al. (2008) but is substantially simpler and more efficient. 

When applied to the weak agnostic parity learning algorithm of Goldreich and Levin (1989) our 
algorithm yields a simple PAC learning algorithm for DNF and an agnostic learning algorithm for deci- 
sion trees over the uniform distribution using membership queries. These results substantially simplify 
Jackson's famous DNF learning algorithm (1994) and the recent result of Gopalan et al. (2008). 

We also strengthen the connection to hard-core set constructions discovered by Klivans and Servedio 
(1999) by demonstrating that hard-core set constructions that achieve the optimal hard-core set size 
(given by Holenstein (2005) and Barak et al. (2009)) imply distribution-specific agnostic boosting algo- 
rithms. Conversely, our boosting algorithm gives a simple hard-core set construction with an (almost) 
optimal hard-core set size. 

1 Introduction 

A boosting algorithm is a technique for combining the outputs of a learning algorithm(s) of low but 
non-trivial accuracy to obtain a hypothesis of high(cr) accuracy. Since its introduction by Schapire [31] 
in the Valiant's PAC learning model [33] it has become one of most studied areas in the theoretical and 
applied machine learning and also one of the tools widely used in practice. 

While numerous boosting algorithms are known that can boost the accuracy of a weak PAC learner 
|26j to an arbitrarily high value, very few boosting algorithms can provably improve the accuracy in the 
presence of noisy or inconsistent data0. 

A natural model of learning without the PAC model assumptions on the target function is the 
agnostic learning model of Haussler [IBJ and Kearns, Schapire and Sellie [25]. The goal of an agnostic 
learning algorithm for a concept class C is to produce, for any distribution on examples, a hypothesis h 
whose error on the distribution is close to the best possible by a concept from C . This model reflects 
a common empirical approach to learning, where few or no assumptions are made on the process that 
generates the examples and a limited space of candidate hypothesis functions is searched in an attempt 
to find the best approximation to the given data. 



A number of boosting algorithms were designed specifically to address the suboptimal performance of early boosting 
algorithms on noisy data. However they were either still analyzed in the noiseless PAC model (e.g. [TO]) or only tested 
empirically. 
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The problem of boosting the accuracy of a weak learner in the agnostic learning framework was 
first considered by Ben-David, Long and Mansour [2]. The weak learner that was given to the boosting 
algorithm in their definition was a /3-optimal agnostic learner, namely an agnostic learner that for any 
distribution A, produces a hypothesis with error A + /3, where A is the error of the best hypothesis in 
C (relative to A) . Ben-David et al. described a boosting algorithm that for a certain range of values A 
and (3 produces a hypothesis that has a lower error than the provided weak learner. In a subsequent 
work Gavinsky showed that a /3-optimal agnostic learner can be boosted to a learner that achieves the 
error of r^rg + e in time polynomial in 1 /e Q2] . He has also shown that this error is within the factor 
of 2 from the best achievable for this problem. 

Recently Kalai, Mansour and Verbin have examined boosting a different type of weak learner [22] . 
Specifically, they define an (a, 7)-weak agnostic learner to be a learning algorithm that produces a 
hypothesis with the error of at most 1/2 — 7 whenever A < 1/2 — a. Kalai et al. gave a boosting 
algorithm that boosts any (a, 7)-weak agnostic learner to an (a + e)-optimal agnostic learner in time 
polynomial in 1/7 and 1/e. They have also demonstrated that such a boosting algorithm can be used 
to obtain the first non-trivial distribution-independent agnostic learning algorithm for parities. Their 
boosting algorithm is based on a boosting-by-branching-programs algorithm of Mansour and McAllester 
[30] and its analysis by Kalai and Servedio [23] . 

As these agnostic boosting algorithms are based on boosting algorithm in the PAC learning frame- 
work, they work by applying the weak learner to the target function on carefully constructed distributions 
over the domain. This implies that such boosting algorithms can only be applied in the distribution- 
independent setting. (One notable exception to this rule is Jackson's algorithm for learning DNF over 
the uniform distribution [19] that boosts the accuracy via an ad hoc extension of the weak learner of 
Blum et al. [3] to distributions that are close to the uniform). 

1.1 Our Results 

We present a simple distribution- specific agnostic boosting algorithm for (a, 7)-weak agnostic learners. 
That is, our boosting algorithm does not modify the marginal distribution over the domain of the 
learning problem but instead modifies the distribution on the label of each example. 

Theorem 1.1 There exists an algorithm ABoost that for every concept class C and distribution D over 
X, given an (a,"/)-weak agnostic learning algorithm A for C over D, agnostically and a- optimally learns 
C over D. Further. ABoost invokes A 0(7 -2 ) times and runs in time poly(T, I/7, 1/e), where T is the 
running time of A. 

Our boosting algorithm implies that weak agnostic learning with respect to any specific distribution 
is equivalent to (strong) agnostic learning with respect to the same distribution (see Theorem 13.21 for 
the formal statement). An immediate application of this result is a simple agnostic learning algorithm 
for decision lists over the uniform distribution using membership queries (see Lemma 13. 3p . Recently, 
Gopalan, Kalai and Klivans gave the first algorithm for this problem [TJ]. Their proof is based on a 
substantially more involved and delicate argument. 

In Section 13.21 we use our boosting algorithm to extend the observation that agnostic learning of 
a class C implies PAC learning of low-weight linear thresholds of functions from C (25] [9] [28] to a 
distribution specific setting. For a set of functions C and integer W denote by TH(W^ C) the set of all 
functions representable as sign(^ i<w fi{x)) where for all i, fi € C. 

Theorem 1.2 If C is efficiently agnostically learnable with respect to distribution D then TE(W,C) is 
efficiently PAC learnable over D for any W upper-bounded by a polynomial in the learning parameters. 

An immediate application of this result is a simple proof that DNF are learnable over the uniform 
distribution using membership queries [TH] (we include the details in Section 13. 2[) . It also allows to 
simplify the analysis in many subsequent algorithms for learning DNF that use the same boosting-based 
approach (e.g. [HE1GI])- In addition, this result gives a new implication of an agnostic algorithm for 
learning DNF that is posed as an open problem by Gopalan et al. [13] . 



2 



We show that our boosting algorithm can also be viewed in a more traditional setting where the 
boosting algorithm runs the weak learner on modified marginal distribution but does not modify the 
label distribution. In particular, in the setting of Ben-David et al. [2] our boosting algorithm achieves 
the optimal accuracy of yz^rg + e (or 1/2 of the accuracy achieved by Gavinsky's boosting algorithm 
[12]). The details of this version are given in Section l3~Tl 

Boosting algorithms are also known to be closely related to hard-core set constructions [37] , a tech- 
nique in hardness amplification [18j . Given a function / that cannot be r-approximated on X by circuits 
of certain size s the goal of a hard-core set construction is to construct a sufficiently large subset of X on 
which / cannot be (1/2— 7)-approximated by circuits of a slightly smaller than s size. Here we strengthen 
the connection discovered by Klivans and Servedio [27] by observing that hard-core set constructions 
achieving the optimal hard-core set size of 2r give agnostic boosting algorithms. The first construction 
with this property was given by Holcnstcin who used the construction to obtain a key agreement pro- 
tocol from a weak bit agreement primitive [17] . In a recent work Barak, Hardt and Kale demonstrated 
a more efficient hard-core construction with this property [1] . Both of these constructions can be easily 
translated into agnostic boosting algorithms. In addition, we show that our agnostic boosting algorithm 
gives a new hardcore set construction algorithm with an almost the optimal hard-core set size parameter. 
Our technique of achieving the optimal hard-core set size is different from the method of Holenstein [T7] 
(which is also used by Barak et al. PQ) and the resulting algorithm is simpler to analyze. The relation 
to hard-core set constructions is presented in Section [4] 

1.2 Techniques 

Our boosting algorithms build a hypothesis h : X —> [—1, 1] in steps starting from the ho = hypothesis. 
At step i the weak learner is run on points drawn randomly from the base distribution D and the labels 
given by (f(x) — h%(x))/2, that is the expectation of the random { — 1, 1} label assigned to point x is 
(f(x) — hi(x))/2. A weak hypothesis for this distribution satisfies E^j [(/(&) — hi(x)/2)g(x)] > 2j. We 
let h' i+1 = hi+j-g. It is easy to see, that after this step h' i+1 is "closer" to / than hi when the functions 
are viewed as vectors in the appropriate Euclidean space. This argument requires the hypothesis at 
each step to have range in [—1, 1] and therefore we apply a projection step. Namely hi+± is obtained 
from h' i+1 by cutting off all values outside the range [—1, 1]. This step only reduces the distance. This 
algorithm and the distribution-specific view of boosting are implicit in [8] where the algorithm is used 
to characterize the query complexity of statistical query (SQ) [24] learning (in the PAC and the agnostic 
models) using a characterization of weak SQ learning. However the algorithm we described so far can 
only guarantee a hypothesis with the error equal to twice the optimum. To achieve the optimum we add 
new "balancing" steps to process. Namely, we test the hypothesis — sign(/ij) on the data distribution 
produced at step i. If this hypothesis has non-trivial performance it is used to update hi in the same 
way as the weak learner. Otherwise, it is easy to show that sign(/ij) has half the error of hi itself which 
is close to the optimum at the end of the boosting process. 

For our application to distribution-independent boosting and hard-core set construction we also give 
a stronger boosting algorithm that uses the same argument but on the basis of a slightly different way 
to produce distribution together with a corresponding distance function (see Theorem ll.il) . 

1.3 Related Work 

Kalai and Kanade have very recently and independently demonstrated a different distribution-specific 
agnostic boosting algorithm [2U] . Their boosting algorithm is based on a smooth version of Adaboost 
[TT] by Domingo and Watanabe [7J and Servedio [32] and uses an equivalent of our "balancing" step. 
It requires a similar number of boosting stages and running time as our ABoost algorithm. They also 
show an analogous application to agnostic learning of decision trees (see Lemma l3.3j) . In addition Kalai 
and Kanade give a simpler version of the agnostic halfspace learning algorithm of Kalai et al. |21j and 
include results from an empirical evaluation of their algorithm. 
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2 Preliminaries 



Let X denote some fixed domain and let denote the set of all functions from X to [—1,1] (that 
is all the functions with norm bounded by 1). It will be convenient to view a distribution D over 
X as defining the product (cf>,ip)D — ~Bx~d\4>{x) • VK^)] over the space of real-valued functions on X. 
It is easy to see that this is simply a non-negatively weighted version of the standard dot product 
over M. x and hence is a positive semi-inner product over R . The corresponding norm is defined as 

H\\ D = y/-E D [^(x)] = V^hJ. 

2.1 Agnostic Learning 

The agnostic learning model was introduced by Haussler 16] and Kearns et al. 25] in order to model 
situations in which the assumption that examples are labeled by some f G C does not hold. In its least 
restricted version the examples are generated from some unknown distribution A over X x {—1,1}. The 
goal of an agnostic learning algorithm for a concept class C is to produce a hypothesis whose error on 
examples generated from A is close to the best possible by a concept from C. Any distribution A over 
X x {—1,1} can be described uniquely by its marginal distribution D over X and the expectation of 
b given x. That is, we refer to a distribution A over X x { — 1, 1} by a pair (Da,4>a) where Da(z) = 
Vr{x,b)~A[x = z] and 

(/> A {z) = E^ ib )^ A [b | z = x\. 
Formally for a Boolean function ft and a distribution A = (D, <f>) over X x {—1,1}, we define 

A(A,h) =Pr {xM ^ A [h(x) + b\. 

We will frequently use the following simple equality A(A,h) = (1 — (0, ft)£>)/2. For a concept class C, 
define A(A, C) = inf heC {A(^4, ft)} . 

Kearns et al. [25] define agnostic learning as follows. 

Definition 2.1 An algorithm A agnostically learns a concept class C by a representation class H if for 
every e > 0, 8 > 0, distribution A over X x {—1, 1}, A given access to examples drawn randomly from 
A, outputs, with probability at least 1 — 5, a hypothesis h G H such that A(A, h) < A(A, C) + e. 

As in the PAC learning, the learning algorithm is efficient if it runs in time polynomial 1/e, log (1/5) 
and n. Here and elsewhere when not noted otherwise, we use n as a bound on the description length of 
every concept in C and also the dimension of the domain. 

In the distribution-specific version of this model, learning is only required for every A — (D, (f>), 
where D equals to some fixed distribution known in advance. 

In order to define boosting in the agnostic setting we use the following definitions from [22] . For 
< /3 < 1/2 we say that a learning algorithm is /^-optimal agnostic if for every e > the algorithm 
produces a hypothesis ft such that A(A, ft) < A(^4, C) + (3 + e. We note that Ben-David et al. [2] use 
a slightly stronger notion of /?-optimality that does not have the extra e but this will not be significant 
for our discussion. 

For < 7 < a < 1/2 we say that a learning algorithm is (a,7)-weak agnostic if the algorithm 
produces a hypothesis ft such that A(A, ft) < 1/2 — 7 whenever A(A, C) < 1/2 — a. 

For convenience when discussing weak agnostic learning we also define T(A, ft) = 1/2 — A(A, ft) and 
T(A, ft) = 1/2 — A(A, C) accordingly. A weak agnostic learning algorithm is an algorithm that can recover 
at least a polynomial fraction of the advantage over the random guessing of the best approximating 
function in C. Specifically it produces a hypothesis ft such that T(A, ft) > p(l/n,T(A,C)) for some 
polynomial •). 

3 Agnostic Boosting 

The main component of the agnostic boosting algorithm in the work of Kalai et al. [22] is a more general 
algorithm that boosts every (a,7)-weak agnostic learner to an a-optimal agnostic learner. We first show 
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a weaker algorithm that boosts any (a, 7)-weak agnostic learner to a 2a-optimal agnostic learner (but 
is sufficient for boosting a weak agnostic learning algorithm to a strong one). 

Theorem 3.1 There exists an algorithm A2boost that for every concept class C and distribution D 
over X, given an (a, 7) -weak agnostic learning algorithm A for C over D agnostically and 2a- optimally 
learns C over D. Further, A2boost invokes A 0(7 -2 ) times and runs in time poly(T, I/7, 1/e), where 
T is the running time of A. 

Proof: Let A = (D, <j>) be the target distribution over examples. Our algorithm performs a form 
of gradient descent to the unknown target function 0, where the weak agnostic learning provides the 
equivalent of gradient computation. 

We start with a hypothesis h = 0. Let hi G J 7 ^ be the current hypothesis. We run the algorithm 
A on examples from A^ = (D, (<j> — hi)/2). Note that (<fi — hi)/2 G and therefore this is possible. 
If a hypothesis g with error of at most 1/2 — 7 is output by A we update hi using g in the way we 
describe later. Otherwise, we test the error of — sign(/ij) on distribution Aj. If the error is at most 
1/2 — e/2 then we update hi using — sign(/ij). We refer to this update as balancing. If neither of 
these conditions holds the algorithm stops and outputs sign(/ii) as its final hypothesis. To reduce the 
number of potentially more expensive invocations of the weak learner we perform balancing steps until 
A^.-Bign^)) > 1/2 -e/2. 

To update hi using a function gi (which is either g or — sign(/ij)) that has error 1/2 — 7$ wc a dd 
7i • gi to hi and then truncate all the values outside of [—1, 1]. Namely we set h' i+l = hi + 7$ • gi and let 
K+i = Pi{h' i+1 ), where 

I sign(a) otherwise. 

We note that when gi = — sign(/ij) the projection step Pi will not be necessary since h' i+1 G . 

We first prove that this process will terminate after at most 0("{~ 2 ) invocations of the weak learner 
and 0(e~ 2 ) balancing steps. To show this we prove that in each step hi is closer to <j> by at least 37?. 
Specifically, we claim that 

H-h i+1 \\ 2 D < ||0 -Mi, -37? ■ 

By the definition, has error 1/2 — 7$ on Aj. This is equivalent to (<f> — hi,gi)o > 27j. Therefore 

U ~ K +1 \\ 2 D = U - + ll9l )\\ 2 D - U - hi\\ 2 D - 2^ - h u9i ) D + 7?|| 5i |||, < \\<f> - h£ D - 4 7j 2 + 7 « 2 
= ||0-/ il || 2 D -37f 

We now observe that the projection step can only decrease the distance to <j>, in other words \\4> — 
Pi(K+i)\\d — 11^ — ^i+illl)- This follows easily from the fact that for any value b G [—1, 1] and any real 
value a, (b -Pi (a)) 2 < (b -a) 2 . Hence, \\(f> - h l+1 \\ 2 D < \\<j> - h' i+1 \\ 2 D . 

By the definition, after each successful invocation of the weak learner 7$ = 7 and at most one 
not successful invocation of the weak learner is performed for every successful one. Similarly, in each 
balancing update 7$ = e/2 and at most two tests of the error of — sign(/ij) are performed for each 
balancing update. In addition, \\<p — /io||d = ||<^||r> < 1 an d therefore the process has to terminate after 
at most 2j~ 2 /3 invocations of the weak learner and 4e _2 /3 balancing updates. 

We now need to prove that the final hypothesis h = sign(/i t ) satisfies A(A, h) < A(A, C)+2a+e. Let 
c G C be the function such that A(A, C) = A(A, c) or {<j>,c) D = l- 2A{A, C). By the definition of the 
final hypothesis h, we know that our boosting algorithm has not received a weak hypothesis with error 
< I/2 — 7. By the property of (a, 7)-weak agnostic learning this implies that A(A t ,C) > A(A t ,c) > 
1/2 - a, or {{(j) - h t )/2,c) D < 2a. This gives us that (h t ,c) D > {<j>,c) D - 4a = 1 - 2A{A,C) - 4a (*). 

In addition, we know that the error of — sign(/i t ) on A t is at least 1/2 — e/2. That is, {{<j> — 
h t )/2, -sign(/i t )) D < e. This gives (<p, sign(ft t )) D > (h t , sign(/i t )) D -2e. We observe that (h t , sign(/i t )) D > 
(ht,c)D and combine this with (*) to obtain 

(0, h) D = (0, sign(/i t ))D > (h t , sign(/i t )) D - 2e > (/i t , c) D - 2e > 1 - 2A(A, C) - Aa - 2e . 
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Therefore A (A, /i) = (1 - (</>, /i) D )/2 < A(A, C) + 2a + e. 

Finally, we note that we assumed that ji is known to the boosting algorithm. It is easy to see that we 
can use an appropriate estimate in the analysis above. Specifically we use random samples to estimate 
the error of the weak hypothesis g within 7/4. If the estimate is smaller than 1/2 — 37/4 we update hi 
using g with the empirical estimate in place of the true value 7,. It is easy to see that in this case the 
distance will be reduced by at least 157 2 /16. The error estimate of — sign(/ii) is treated analogously. □ 

We can now show that efficient weak agnostic learning with respect to distribution D implies efficient 
agnostic learning with respect to distribution D. 

Theorem 3.2 Let C be a concept class and D be a distribution over X such that C is efficiently weakly 
agnostically learnable over D. Then C is efficiently agnostically learnable over D. 

Proof: By the definition, a weak agnostic learning algorithm gives a (r, p(l/ n, r))-weak agnostic learning 
algorithm for every r and some fixed polynomial p(-, •). By boosting an (e/3,p(l/n, e/3))-weak agnostic 
learning algorithm using A2boost with the accuracy parameter set to e/3 we obtain an algorithm that 
outputs a hypothesis with performance A (A, C) + e, in other words a strong agnostic learning algorithm. 
Note that the marginal distributions used in every stage of boosting are the same as in the original 
problem and the running time is polynomial in n and 1/e. □ 
An immediate application of Theorem 13.21 is a simple proof of agnostic learnability of decision trees 
over the uniform distribution and using membership queries that was recently obtained by Gopalan et 
al. QI]. 

Lemma 3.3 Let C s be the concept class of decision lists of size s over {0,1}™. C s is agnostically 
learnable over the uniform distribution and using membership queries in time polynomial in n, s and 
1/e. 

Proof: As it has been shown by Kushilevitz and Mansour, the L\ norm of the Fourier representation 
of a decision tree of size s is at most s [29]. Namely, if c is a decision tree of size s then Li(c) = 
Sae{o 1}" l^( a )l — s > where c(a) is the Fourier coefficient of c with index a. Now let U be the uniform 
distribution over {0, 1}™, 4> £ and let A = (U, 4>). If A(A, C s ) < 1/2 - t then there exists c £ C s 
such that (c, <j))jj > It. But c = X) a e{o i} n &( a )Xa( x ) and, in particular, 

(c,(f>)u= ^2 c(a)(xa(x),<j>)u > 2r . 

a£{0,l}" 

This implies that there exists a' such that \(xa'(x),<t>)u\ > 2r/ii(c) > 2r/s. Therefore A(A, Xa') < 
1/2 — t I s or A(A, — Xa') < 1/2 — t/s. This implies that an agnostic learning algorithm for parity with 
e = t I (2s) is also a weak agnostic learner for decision trees of size s. An agnostic learning algorithm for 
a parity function over the uniform distribution and using membership queries was given by Goldrcich 
and Levin [13] (see also [29]). To finish the proof we simply need to apply Theorem 13.21 □ 
We now show that a simple modification to the distributions and the potential function used in 
A2boost gives an agnostic boosting algorithm from (a, 7)-weak agnostic learning to a-optimal agnostic 
learning. 

Theorem 3.4 (Restated from [T7l|) There exists an algorithm ABoost that for every concept class C 
and distribution D over X, given an (a,"/)-weak agnostic learning algorithm A for C over D agnos- 
tically and a-optimally learns C over D. Further, ABoost invokes A (3(7~ 2 ) times and runs in time 
poly(T, I/7, 1/e), where T is the running time of A. 

Proof: First we assume for simplicity that <f> — f f° r some Boolean /, that is, the examples are labeled 
by a function. The proof is based on the same idea as the proof of Th. 13.11 However in order to avoid the 
double loss of a we use a different distribution Ai at every stage. Specifically, we let Ai = (D, P%(f— hf)) 
while the rest of the algorithm is exactly the same. As before in order to prove the claim we first prove 
that the boosting process will terminate after at most 0{^~ 2 + e~ 2 ) steps. The potential function whose 
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gradient is P\{f ~ hi) is defined as follows. For a real a let 

R(a) = / 0,2 \ a \< 1 

1 2|a| — 1 otherwise 

(It is easy to see that for every b G { — 1, 1}, dR(b — a)/da = —Pi (6 — a).) The potential of function 
h G -Pf° relative to / and D is defined to be Eu[R(f — h)]. We next claim that for every Boolean / and 
distribution £), 

1. E D [R(f - h )] = E D [R(f)] = 1 and for any real-valued function tp, E D [R{f - > 0; 

2. lf{P 1 (f-hi),gi) D > 2 lt thenF, D [R(f-(hi+-f i g i ))} < E D [R(f - hi)] - 3 7 2 ; to see this we simply 
observe that for every point x, 

R(f(x) - {hi{x)+ ligi {x))) - R(f(x) - hi(x)) < -2 7i J\(/(x) - hi(x)) 9i (x) + ( lt g t (x)) 2 . 

3. E D [R(f - P (/*))] < E D [R(f - h)]. 

For the second part of the proof we prove that A(A, h) < A(A, C) + a + e/2. As in the previous 
proof, (f,c)o > 1 — 2A(A,C). In addition, the stopping condition implies that {P\{f — h t ),c) D < 2a 
and (Pi (/ — h t ), — sign(/i t )} d < e. We now observe that Li norm of our distribution function Pi (/ — h t ) 
is small. Namely 

EoOPxCf - ht)\] =E D [f- Pxif - h t )] =E D [c- P x {f - h t )] + E D [(f - c) • P x (/ - ht)] <2a + E D [\f - c[ 
<2a + 2A(A,C). (1) 

For the second step we show that 



Pr[/ ^ sign^)] = -E^tlZ-sign^)]] < -E D [Pi(f-ht)(f-sign(h t ))] < -(E^U -h t )\]+e). (2) 

By combining equations (P) and © we get that Pr^[/ ^ sign(/i t )] < A(A, C) + a + e/2. 

Finally, we note that if the function 4>(x) is not Boolean we can reduce the analysis to the Boolean 
case by treating each point x as two points: one with probability D(x){\ + 4>{x)/2) with the target 
function equal to 1 and the other one with probability D(x)(l — 4>{x)/2) and the target equal to —1. All 
functions that we consider are treated as identical on both of these points. The hypotheses we generate 
are combinations of the weak learning hypotheses and therefore will also be identical on both of these 
points. □ 



3.1 Relation to Distribution Independent Boosting 

The boosting algorithms of this section can also be viewed in the regular setting where the boosting 
algorithm uses the weak learner on artificially constructed distributions over the domain. To see this 
one can observe that by outputting a random coin with expectation (f(x) — h(x))/2 (or Pi(f(x) — h(x)) 
in the case of ABoost) instead of f(x) we reduce the contribution of the correlation on point x to the 
total value of correlation in the same way as the regular boosting algorithms modify the weights of the 
point x to reduce or increase the contribution. At the same time, as demonstrated by Ben-David et al. 
[2] and Gavinsky [12j modifying the distribution does give the boosting algorithm an ability to boost 
beyond an a-optimal solution (at the expense of a stronger assumption). We demonstrate this by giving 
the following version of our boosting algorithm. 

Theorem 3.5 There exists an algorithm ABoostDI that for every concept class C over X, given a 
distribution independent (a,"/)-weak agnostic learning algorithm A' for C, for every distribution A = 
(D, f) over X and e > 0, produces a hypothesis h such that Pro[/ h] < A 1 ^ 2 a' ) e ' P ur ^her, ABoost 
invokes A' 0(7 _2 A /_1 log (1/A')) times for A' = A(A, C)/(l - 2a) and runs in time poly(T,l/j,l/e), 
where T is the running time of A! . 



7 



Proof: We first observe that for every Boolean function /, h G ^f , and x G X, Pi(f(x) — h(x)) = 
f{x)\P\(f(x) — h(x))\. This implies that for every function g G , 

E D [P 1 (f(x)-h(x))g(x)]^E D [\P 1 (f(x)-h(x))\f(x)g(x)}=E Dh [f(x)g(x)}-N h , 

where Dh is the distribution defined by the density function Dh(x) = D(x)\Pi(f(x) — h(x))\/Nh and 
Nh = E£)[|Pi(/(x) — /i(a;))|] is the normalization factor. Therefore if A' provides a hypothesis that 
satisfies ~ED h [f(x)g(x)] > 2j then F,o[Pi(f(x) — h(x))g(x)] > 2-f ■ Nh- In order to bound the number of 
boosting stages we need to lower bound 7-iV/j. To do this we note that Ed[\Pi{J — h)\] > Pr[/ ^ sign(ft,)] 
and therefore we can assume that Nh > A' which is the desired final error of the boosting algorithm 
that we will determine later. Otherwise the error of sign(/i) is already sufficiently and we can stop the 
boosting (using an additional testing step) . This implies that the total number of calls to the weak learner 
is 0(7" 2 A'- 2 ). We can also sharpen this bound by noticing that N h = 'E D [\P 1 (f-h)\] > E D [R(f-h)]/3. 
This implies that at every stage when the weak learner's hypothesis is used 

E D [R(f-h i+1 )\ < E D [R(f-h t )]-S(j-N h ) 2 < E c [i?(/-^)](l-7 2 '^) < E D [i?(/ - h t )](l - 7 2 ■ A') . 

This implies that after at most t = 7~ 2 A'- 1 In (A'" 1 ) steps Pr[/ ^ sign(/i t )] < E D [R(f - h t )] < A', 
which implies that the boosting process will terminate. 

Finally, we need to define A'. By the definition, A' returns a good weak learner whenever Eu h [f(x)g(x)\ > 
2a and hence if the boosting stopped then E^[c • Pi (/ — h t )] < 2a- Nh t ■ By plugging this into equation 
(Q}, we obtain that 

N ht =E D [\P 1 (f-h t )\] <2A(A,C) + 2a-N ht , 
which gives us the bound Nh t < ^7^0^ • ^ plugging this into equation @ we obtain that 

Pr[/ + sign(^)] < N ht /2 + e/2 < + e/2 . 

d l — la 

We therefore set A' = ^Af± + e/2. □ 
To compare this result with the boosting algorithms in the setting of Ben-David et al. [5] we note 
that, by the definition, any /3-optimal agnostic learner is in particular, a {f3 + 7', 7'/2)-weak agnostic 
learner for any inverse-polynomial 7' > 0. Therefore ABoostDI applied to a /3-optimal agnostic learner 
returns a hypothesis with the error of at most ^^p-, + e/2 < A 1 ^^'- ) + e for 7' = e/(4(l — 2(3)). As 
demonstrated by Gavinsky, this is optimal [12j . 

Remark 3.6 We remark that ABoostDI does not modify the target function and therefore is also ap- 
plicable in the PAC framework. In addition, ABoostDI has the optimal smoothness. That is, when 
learning to accuracy e' the weight of any point under any of the distribution generated by the boosting 
algorithm is at most l/(2e' — e) times higher than the weight under D. This is true since Dh(x) — 
D(x)\Pi(f(x) — h{x))\/Nh and Nh > 2Pr[/ ^ sign(/i)] — e. Smoothness property is crucial in a number 
of applications of boosting algorithms such as learning DNF over the uniform distribution J19f . learning 
with malicious noise i32\/ and the connection to hard-core set constructions 

3.2 Applications to PAC Learning 

It has long been noted that efficient agnostic learning of a concept class C over a distribution D implies 
efficient weak learning of the class of functions expressible as low-weight linear thresholds of functions 
from C over D [25]. It therefore follows that distribution-independent agnostic learning of C implies 
PAC learning of TH(W / , C) (see Section 11.11 for the definition) for any polynomially bounded W . Our 
goal is to strengthen this implication to distribution-specific learning. 

Theorem 3.7 (Restated from [T72|) If C is agnostically learnable with respect to distribution D in 
time T(n,e) thenTE{W,C) is PAC learnable over D in time 0({W/e) 2 -T(n, e/(4W)) + poly(n,W,l/e))- 
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Proof: The reduction from PAC to agnostic learning relies on the discriminator lemma of Hajnal el 
al. [T5] stating that for every / G TH(Wi C) ancl every distribution D, there exists a function in d G C 
such that |(/, c')d\ > 1/W. Now for h G J 7 ^, let Dh be the distribution defined in the proof of Th. 
13.51 (that is Dh(x) = D(x)\Pi(f(x) — h(x))\/Nh). The discriminator lemma implies that there exists 
c' such that (/, c')_D h > 1/W. This implies that (/ — h,c') o\ > Nh/W. As we have showed in 
Th. 13.51 Nh > Prz>[/ ^ h] > e since the desired accuracy of PAC learning is e. This implies that 
|(/ — h, cf)n\ > e/W. Therefore when the agnostic learner for C is run on the distribution (D,f — h) 
with e' = e/ (AW) it will return a function g such that (f — h, g)u > e/ (2W) (for simplicity we can assume 
that C is closed under negation; alternatively we can also run the agnostic learner on the negation of 
(D, f — h) and negate the result). We can therefore use g in the same way as in the Theorem 1 1 . 1 1 with 
7 = e/(4W) and the accuracy of the boosting process set to e" = e/2 until the accuracy of sign(/ij) 
reaches e. The total number of boosting stages is 0((W/e) 2 ) and the running time is polynomial in 1/e 
and W and the running time of the agnostic learner for C (with e' = e/(AW)). Finally we note that 
in this result one can also use the slightly simpler A2boost in place of ABoost and there is not need 
to balance hi by testing — sign(/ij) hypothesis (since those only affect the constant multiplier of the 
accuracy) . □ 

One application of such a result is that it gives a DNF learning algorithm directly from a uniform 
distribution agnostic parity learning algorithm (such as the Kushilevitz-Mansour algorithm [29]) without 
the need for a specialized analysis of the Fourier Transform of distribution functions given by Jackson 
[j"9] and used in many subsequent works. This follows from the fact that polynomial size DNF formulas 
can be represented as low- weight thresholds of parities (or TOP) [19J . We further note that the resulting 
algorithm is the same as that obtained in Lemma 13.31 (up to the setting of the parameters). 

A related corollary of this result is that agnostic learning of DNF formulas from membership queries 
over the uniform distribution (see |14j for the problem definition) would imply (strong) learning of 
dcpth-3 circuits (even with a majority gate at the top) with membership queries. 



4 Relation to Hard-core Set Construction 

We start with a couple of definitions relevant to hard-core set construction. We say that a function 
/ is A-hard for size s if for every circuit z of size at most s, Pry [f(x) ^ z(x)\ > A, where U denotes 
the unform distribution over X. A measure M over X is a function from X to [0,1]. The density of a 
measure M is defined to be fi(M) = (J2 x ex M(x))/\X\. We say that / is 7-hard-core on M for size s if 
for every circuit z of size at most s, Ptu m [f(x) 7^ z(x)] > 1/2 — 7, where Um is the distribution on X with 
density function Um(%) = M(x)/fi(M). Similarly, we say that / is 7-hard-core on a set SCI for size 
s if / is 7-hard-core on Ms for size s, where M${x) is the characteristic function of S. It is well-known 
that in order to construct a hard-core set of size 8 • \X\ it is sufficient to construct a hard-core measure 
of density at least S [18j . All known uniform constructions of a measure for which / is 7-hard-core 
are essentially boosting algorithms that construct a sequence of measures Mq, Mi, . . . each of density at 
least 6 such that if / is not 7-hard-core on Mi for size s then a circuit Zi that (1/2 — 7)-approximates 
/ on Mi is used to create Mi + \. If this process does not stop after a certain number of steps then z^s 
can be combined to obtain a circuit that A-approximates /. This contradicts A-hardness of / for some 
size s' and therefore implies that / is 7-hard-core on one of the constructed measures. An important 
parameter of such a construction is the density 5 as a function of A (and sometimes 7). Impagliazzo 
showed a construction of a hard-core set of size A and asked whether the optimal size of 2A is achievable 
[i"8] . Holenstein [17] gave the first construction with the optimal hard-core set size parameter on the 
basis of Impagliazzo's hard-core set construction [18j . In a recent work Barak et al. gave a more efficient 
construction based on multiplicative updates and Bregman projections pp. 

It is easy to see that the distribution Um is 1/5-smooth if and only if the measure M has density 6 
|27j . Gavinsky has demonstrated that the smoothness of distributions produced by a boosting algorithm 
determines the error that the boosting algorithm will achieve when boosting a (distribution-independent) 
/3-optimal agnostic learner [12) . Our goal is to combine these observations in the context of distribution- 
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specific agnostic boosting. Namely we are going to show that hard-core set constructions that achieve 
the optimal set size of 2A are (distribution-specific) agnostic boosting algorithms. We start with a formal 
statement of the hard-core set lemma with the optimal set size parameter. 

Lemma 4.1 ([17, [T]) Let f be a Boolean function over a domain X, s be an integer, S > and 
7 > 0. Suppose, there exists an algorithm A which for any measure M over X of density 6, given 
access to random examples from distribution (UM,f), returns a circuit z of size at most s such that 
Pvu M [z(x) ^ f{x)] < I/2 — 7. Then there is an algorithm B which for every f , given access to random 
examples from distribution ([/,/), with probability at least 1/2 (over the internal randomness of B) 
returns a circuit z' of size s' such that Pvulz^x) ^ f(x)] < 8/2. Furthermore, the algorithm B invokes 
A t(X/S, I/7) times; reguires time m(l/S, I/7, s, T) and s' is linear in i(l/<5, 1/7) and s, where T is the 
time to simulate A, and t{-, ■) and m(-, ■) are fixed polynomials. 

First, while this lemma is stated for the uniform distribution on X it is known and easy to verify 
directly that it also holds for any distribution D over X (implicit in pQ and in general can be obtained 
by taking a sufficiently large sample from D and viewing a uniform distribution over the sample). 
Namely, the lemma holds even if one replaces the uniform distribution by D, density by density relative 
to D, that is hd{M) = ~Ed[M(x)], and Um with the distribution Dm defined by density function 
D M (x)=D(x)M(x)/n D (M). 

Now let C be a concept class, A = (D, f) be a distribution over examples and A' be an (a,7)-weak 
agnostic learner A' for C over D. To obtain an agnostic boosting algorithm we replace the algorithm 
B in the lemma with A'. To do this we generate examples from distribution Am = (D,f ■ M) and 
run A' on them. We then return the hypothesis g given by A' to the algorithm A. We claim that if 
fi D (M) > 2(A(A, C) + a) then Pr Dju [g(x) ^ f(x)] < 1/2 - 7. Observe that if the claim holds then the 
execution of A will produce a hypothesis z' such that Prrj[z'(x) ^ f{x)\ < A(A, C) + a as desired. At 
the same time the running time is polynomial in the relevant parameters of agnostic learning (in fact it 
does not even depend on the accuracy e). 

To establish the claim we let c be the concept such that A(A, C) = A(A, c) = Pro[/ 7^ c]. Now 

V D [c ■ (/ ■ A/)] = E D [f ■ (f ■ M)] + E D [(c -/)■(/• M)] < E fl [M] - 2Pr[|c- /|] = ^ D {M) - 2 ■ A(A, C) . 

Therefore if (J, D (M) > 2(A(A, C) + a) then E D [c ■ (/ • M)] > 2a or A(A M ,c) < 1/2 - a. Hence, by the 
definition of A', it will return a hypothesis g such that Eu[g • (f • M)] > 27. This gives us that 

V Dm [g-f}= V D {g ■ (/ • M)]/n D (M) > 2 1 / ^ D (M) > 2 1 , 

or Pr DM [g(x)^f(x)]<l/2- 7 . 

Finally, we also observe that ABoostDI gives a hard-core set construction that achieves essentially 
optimal hard-core set size. As we have showed in Remark l3.6l when learning to accuracy e' the algorithm 
ABoostDI produces l/(2e' — e)-smooth distributions for any e (in time polynomial in 1/e). By the 
observation of Klivans and Servedio [27] , this implies that when used for hard-core set constructions the 
algorithm will produce a set of size 2e' — e distributions for any e in time polynomial in 1/e. 

5 Conclusions 

We demonstrated that in the agnostic learning framework strong learning with respect to a specific 
distribution can be efficiently reduced to weak learning with respect to the same distribution. Further, 
we showed that this can be done using a variety of methods, some new and simple ones given here but 
also via a simple adaptation of two known algorithms [171 IT) (and yet another method was just discovered 
independently [20]). In our opinion these findings testify that boosting in the agnostic learning framework 
is at least as natural and powerful phenomenon as it is the PAC model. The agnostic learning model 
reflects many of the practical scenarios more faithfully than the PAC model [TB] and hence we suggest 
that the agnostic learning framework is better suited for theoretical analysis of boosting algorithms. 
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One evidence for this is that the main reason why the Adaboost algorithm [TT] is known not to cope 
well with noise is that it places too much weight on the noisy examples [6], in other words it is not 
smooth. As can be seen from our work (and from [12] ) , achieving the strong agnostic guarantees forces 
the boosting algorithm to be optimally smooth. 

In this version of the results we have omitted the discussion of the circuits that combine the weak 
hypotheses and also detailed bounds on the running time of our boosting algorithms. In part, this is 
because our algorithms do not improve on the agnostic boosting algorithm derived from the algorithm 
of Barak et al. [1] that achieves the optimal number of boosting stages and uses a simple majority 
to combine the weak hypotheses. While the performance of ABoost in the distribution-specific setting 
is essentially the same, our algorithm uses a more complex circuit to combine the weak hypotheses 
(primarily because of the "balancing" step). In addition, this allows us to simplify the presentation of 
the core ideas of the algorithm. 
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